When CORDIAL Becomes Friendly: Endowing the CORDIAL Corpus with a Syntactic Annotation Layer
نویسنده
چکیده
This paper reports on the syntactic annotation of a previously compiled and tagged corpus of European Portuguese (EP) dialects – The Syntax-oriented Corpus of Portuguese Dialects (CORDIAL-SIN). The parsed version of CORDIAL-SIN is intended to be a more efficient resource for the purpose of studying dialect syntax by allowing automated searches for various syntactic constructions of interest. To achieve this goal we adopted a rich annotation system (the UPenn corpora annotation system) which codifies syntactic information of high relevance. The annotation produces tree representations, in form of labelled parenthesis, that are integrally searchable with CorpusSearch, a search engine for parsed corpora (Randall, 2005-2007). The present paper focuses on CORDIAL-SIN annotation issues, namely it presents the general principles and guidelines of the adopted annotation system and describes the methodology for constructing the parsed version of the corpus and for searching it (tools and procedures). Last section addresses the question of how an annotation system originally designed for Middle English can be adapted to meet the particular needs of a Portuguese corpus of dialectal speech. 1. The CORDIAL-SIN corpus The Syntax-oriented Corpus of Portuguese Dialects (CORDIAL-SIN) is being built up at the Linguistics Center of University of Lisbon (CLUL) within the scope of a research project aimed at promoting the study of European Portuguese dialect syntax by means (among other things) of the implementation of an online linguistic resource fulfilling the empirical demands of dialect syntax inquiry 1 . CORDIAL-SIN is a corpus of spoken dialectal EP that collects a geographically representative body of excerpts of spontaneous and semi-directed speech, selected from the oral interviews gathered by the Linguistic Variation Team at CLUL in the course of several Dialect Geography projects (ALEPG; ALEAç; ALLP; BA). At its current state (the final state, in terms of extent), the corpus covers 42 locations within the (continental and insular) territory of Portugal and it compiles about 600 000 words. Map 1 shows the geographical distribution of the CORDIAL-SIN locations. The corpus is available online, on the CORDIAL-SIN website, under three different formats 2 : (i) verbatim orthographic transcripts (which include phonetic and morphological variants and also general spoken language phenomena), (ii) normalized orthographic transcripts (which eliminate phonetic transcriptions of variants and the marked up spoken language phenomena) and (iii) morphologically tagged texts (automatically tagged using the morphological tagger created by M. Finger for the Tycho Brahe Corpus of Historical Portuguese; cf. Finger, 1998, 2000). 1 The CORDIAL-SIN project is supported by national funding (PRAXIS XXI/P/PLP/13046/1998; POSI/1999/PLP/33275; POCTI/LIN/46980/2002; PTDC/LIN/71559/2006). 2 CORDIAL-SIN is part of a European network of dialect syntax, promoted by the ESF-funded project Edisyn, and, in the near future will be also searchable (and interoperable with other dialectal corpora/databases) via the Edisyn Search Engine. Map 1: Geographical distribution of CORDIAL-SIN locations CORDIAL-SIN was compiled and tagged between 1999 and 2007; the corpus syntactic annotation is implemented over POS tagged texts and is currently being carried out. 2. The CORDIAL-SIN syntactic annotation 2.1. The annotation system Presently the CORDIAL-SIN team main goal is to make available a more efficient resource for the purpose of studying (dialect) syntax, namely a parsed version of the corpus that allows searching not only for words or word
منابع مشابه
Uniformly cordial graphs
LetG be a graph with vertex set V (G) and edge setE(G). A labeling f : V (G) → {0, 1} induces an edge labeling f ∗ : E(G) → {0, 1}, defined by f ∗(xy) = |f (x) − f (y)| for each edge xy ∈ E(G). For i ∈ {0, 1}, let ni(f ) = |{v ∈ V (G) : f (v) = i}| and mi(f )=|{e ∈ E(G) : f ∗(e)= i}|. Let c(f )=|m0(f )−m1(f )|.A labeling f of a graphG is called friendly if |n0(f )−n1(f )| 1. A cordial labeling ...
متن کاملCordial Sets of Hypercubes
For a graph G = (V,E) with a binary vertex coloring f : V (G)→ Z2, let vf (i) = |f−1(i)|. We say f is friendly if |vf (1) − vf (0)| ≤ 1, i.e., the number of vertices labeled 1 is the same or almost the same as the number of vertices labeled 0. The coloring f induces an edge labeling f ∗ : E(G) → Z2 defined by f ∗(uv) = f(u) + f(v) (mod 2), for each uv ∈ E(G). Let ef (i) = |{uv ∈ E(G) : f ∗(uv) ...
متن کاملRemainder Cordial Labeling of Graphs
In this paper we introduce remainder cordial labeling of graphs. Let $G$ be a $(p,q)$ graph. Let $f:V(G)rightarrow {1,2,...,p}$ be a $1-1$ map. For each edge $uv$ assign the label $r$ where $r$ is the remainder when $f(u)$ is divided by $f(v)$ or $f(v)$ is divided by $f(u)$ according as $f(u)geq f(v)$ or $f(v)geq f(u)$. The function$f$ is called a remainder cordial labeling of $G$ if $left| e_{...
متن کاملProduct Cordial Sets of Long Grids
A binary vertex coloring (labeling) f : V (G) → Z2 of a graph G is said to be friendly if the number of vertices labeled 0 is almost the same as the number of vertices labeled 1. This friendly labeling induces an edge labeling f∗ : E(G) → Z2 defined by f∗(uv) = f(u)f(v) for all uv ∈ E(G). Let ef (i) = |{uv ∈ E(G) : f∗(uv) = i}| be the number of edges of G that are labeled i. Product-cordial ind...
متن کاملMorpho-syntactic labelling of an oral corpus by decomposing labels
A morpho-syntactic tagger associates to each word of a corpus a label which recapitulates its morpho-syntatic properties in the text. In corpora from oral data, not only do we have to face the usual problem of multi-labels words, but also the more specific problems of disfluences (repetitions, ungrammatical constructions...), of non existing words and of the lack of punctuation marks [1]. First...
متن کامل